fix(cudf): Fix UCX output buffer lifetime after async cuDF concat#17533
fix(cudf): Fix UCX output buffer lifetime after async cuDF concat#17533kjmph wants to merge 7 commits into
Conversation
✅ Deploy Preview for meta-velox canceled.
|
|
Hi @kjmph! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Build Impact AnalysisFull build recommended. Files outside the dependency graph changed:
These directories are not fully covered by the dependency graph. A full build is the safest option. Slow path • Graph generated from PR branch |
CI Failure Analysis
🔴 Linux release with adapters — BUILD Failure View logsBuild errors: The build fails compiling the new test file The offending line in the new test file ( VELOX_CHECK_EQ(
cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking),
cudaSuccess);
Correlation with PR changes:
Known issues:
Reproduce locally: make release # will fail during compilation of CudfVectorTest.cppRecommended fix: Replace // Instead of:
VELOX_CHECK_EQ(
cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking),
cudaSuccess);
// Use:
CUDF_CUDA_TRY(cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking));This is consistent with the |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
997def3 to
832f70e
Compare
832f70e to
a844318
Compare
UcxPartitionedOutput::flushPending() synchronizes input streams before launching cudf::concatenate(), but concatenate itself is asynchronous on the output stream. When multiple pending inputs were flushed, the operator could clear pendingInputs_ immediately after launching concat, allowing the input CudfVector buffers to be deallocated before the concat kernels finished reading them. This caused nondeterministic GPU result corruption. A TPC-H Q18 correctness loop against a DuckDB reference reproduced the issue regularly, and compute-sanitizer reported a use-after-free with this stack: cudf::concatenate UcxPartitionedOutput::flushPending Fix the lifetime ordering without a CPU-blocking synchronize: * make the concat/output stream wait for all input streams before reading input table views; * prefer rebinding owned cuDF input buffers to the output stream so their stream-ordered deallocation happens after concat work on that stream; * fall back to a single event recorded on the output stream and waited by all input streams when inputs cannot be rebound. This keeps the existing memory behavior of releasing pending input buffers before partitioning, while making their async lifetime ordering explicit. Validated by: * TPC-H Q18 GPU correctness loop against stored DuckDB reference * compute-sanitizer with stream-ordered race tracking, confirming the original concat use-after-free is no longer reported
Update the cuDF vector deallocation ordering helpers to accept std::span instead of concrete std::vector references. These helpers only need read-only views over CudfVector pointers and CUDA streams.
a844318 to
a891a35
Compare
CudfVector::rebindStream returned early when the vector's logical stream already matched the target stream. That misses packed_table storage, where gpu_data may still have a different deallocation stream after an intra-node UCX handoff. Update packed_table storage by setting the packed buffer deallocation stream directly. For owned cudf::table storage, rely on the current cuDF rebind_stream API and remove the old compatibility shim. Add CudfVector tests that verify both owned table and packed table buffers are deallocated on the rebound stream.
When CudfVector buffers cannot be rebound to the output stream, orderCudfVectorDeallocationsAfterStream falls back to recording an event on the output stream and making the input streams wait on it. Reuse a per-thread, per-device CUDA event for this fallback instead of creating and destroying a new event on each call. Intentionally leak the event to avoid CUDA calls from thread-local destructors after CUDA context teardown, matching libcudf's stream synchronization pattern.
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Use cuda_async_memory_resource in the CudfVector rebind tests so the instrumented resource records stream-ordered deallocation behavior. cuda_memory_resource ignores the stream and frees via cudaFree, so it only verified the wrapper-observed stream. The async resource makes the test match the lifetime ordering that CudfVector::rebindStream is intended to control.
UcxPartitionedOutput::flushPending() synchronizes input streams before launching cudf::concatenate(), but concatenate itself is asynchronous on the output stream. When multiple pending inputs were flushed, the operator could clear pendingInputs_ immediately after launching concat, allowing the input CudfVector buffers to be deallocated before the concat kernels finished reading them.
This caused nondeterministic GPU result corruption. A TPC-H Q18 correctness loop against a DuckDB reference reproduced the issue regularly, and compute-sanitizer reported a use-after-free with this stack:
cudf::concatenate
UcxPartitionedOutput::flushPending
Fix the lifetime ordering without a CPU-blocking synchronize:
This keeps the existing memory behavior of releasing pending input buffers before partitioning, while making their async lifetime ordering explicit.
Validated by: